Instructions

This file (hdat9600_final_assignment.Rmd) is the R Markdown document in which you need to complete your HDAT9600 final assignment. This assignment is assessed and will count for 30% of the total course marks. The assignment comprises two tasks worth 15 marks each. The first task will focus on logistic regression, and the second task will focus on survival analysis. There is no word limit, but a report of about 10 pages in length when printed (except that it will not be printed) is appropriate.

Don’t hesitate to ask the course convenor for help via OpenLearning. The course instructor are happy to point you in the right direction and to make suggestions, but they won’t, of course, complete your assignments for you!

Data for this assignment

The data used for this assignment consist of records from Intensive Care Unit (ICU) hospital stays in the USA. All patients were adults who were admitted for a wide variety of reasons. ICU stays of less than 48 hours have been excluded.

The source data for the assignment are data made freely available for the 2012 MIT PhysioNet/Computing for Cardiology Challenge. Details are provided here. Training Set A data have been used. The original data has been modified and assembled to suit the purpose of this assignment. While not required for the purposes of this assignment, full details of the preparatory work can be found in the hdat9600-final-assignment-data-preparation file.

The dataframe consists of 120 variables, which are defined as follows:

Patient Descriptor Variables

  • RecordID: a unique integer for each ICU stay
  • Age: years
  • Gender: male/female
  • Height: cm
  • ICUType: Coronary Care Unit; Cardiac Surgery Recovery Unit; Medical ICU; Surgical ICU
  • Length_of_stay: The number of days between the patient’s admission to the ICU and the end of hospitalisation
  • Survival: The number of days between ICU admission and death for patients who died
  • Outcome Variables

  • in_hospital_death: 0:survivor/1:died in-hospital this is the outcome variable for Task 1: Logistic Regression
  • Status: True/False this is the censoring variable for Task 2: Survival Analysis
  • Days: Length of survival (in days) this is the survival time variable for Task 2: Survival Analysis
  • Clinical Variables

    Use the hyperlinks below to find out more about the clinical meaning of each variable. The first two clinical variables are summary scores that are used to assess patient condition and risk.

  • SAPS-I score [Simplified Acute Physiological Score (Le Gall et al., 1984)]
  • SOFA score [Sequential Organ Failure Assessment (Ferreira et al., 2001)]
  • The following 36 clinical measures were assessed at multiple timepoints during each patient’s ICU stay. For each of the 36 clinical measures, you are given 3 summary variables: a) The minimum value during the first 24 hours in ICU (_min), b) The maximum value during the first 24 hours in ICU (_max), and c) The difference between the mean and the most extreme values during the first 24 hours in ICU (_diff). For example, for the clinical measure Cholesterol, these three variables are labelled ‘Cholesterol_min’, ‘Cholesterol_max’, and ‘Cholesterol_diff’.

  • Albumin (g/dL)
  • ALP [Alkaline phosphatase (IU/L)]
  • ALT [Alanine transaminase (IU/L)]
  • AST [Aspartate transaminase (IU/L)]
  • Bilirubin (mg/dL)
  • BUN [Blood urea nitrogen (mg/dL)]
  • Cholesterol (mg/dL)
  • Creatinine [Serum creatinine (mg/dL)]
  • DiasABP [Invasive diastolic arterial blood pressure (mmHg)]
  • FiO2 [Fractional inspired O2 (0-1)]
  • GCS [Glasgow Coma Score (3-15)]
  • Glucose [Serum glucose (mg/dL)]
  • HCO3 [Serum bicarbonate (mmol/L)]
  • HCT [Hematocrit (%)]
  • HR [Heart rate (bpm)]
  • K [Serum potassium (mEq/L)]
  • Lactate (mmol/L)
  • Mg [Serum magnesium (mmol/L)]
  • MAP [Invasive mean arterial blood pressure (mmHg)]
  • MechVent [Mechanical ventilation respiration (0:false, or 1:true)]
  • Na [Serum sodium (mEq/L)]
  • NIDiasABP [Non-invasive diastolic arterial blood pressure (mmHg)]
  • NIMAP [Non-invasive mean arterial blood pressure (mmHg)]
  • NISysABP [Non-invasive systolic arterial blood pressure (mmHg)]
  • PaCO2 [partial pressure of arterial CO2 (mmHg)]
  • PaO2 [Partial pressure of arterial O2 (mmHg)]
  • pH [Arterial pH (0-14)]
  • Platelets (cells/nL)
  • RespRate [Respiration rate (bpm)]
  • SaO2 [O2 saturation in hemoglobin (%)]
  • SysABP [Invasive systolic arterial blood pressure (mmHg)]
  • Temp [Temperature (°C)]
  • TropI [Troponin-I (μg/L)]
  • TropT [Troponin-T (μg/L)]
  • Urine [Urine output (mL)]
  • WBC [White blood cell count (cells/nL)]
  • Weight (kg)
  • Accessing the Data

    The data frame can be loaded with the following code:

    icu_patients_df0 <- readRDS("icu_patients_df0.rds")
    icu_patients_df1 <- readRDS("icu_patients_df1.rds")

    Note: icu_patients_df1 is an imputed (i.e. missing values are ‘derived’) version of icu_patients_df0. This assignment does not concern the methods used for imputation.

    Task 1 (15 marks)

    In this task, you are required to develop a logistic regression model using the icu_patients_df1 data set which adequately explains or predicts the in_hospital_death variable as the outcome using a subset of the available predictor variables. You should fit a series of models, evaluating each one, before you present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge. It is perfectly acceptable to include predictor variables in your final model which are not statistically significant, as long as you justify their inclusion on medical or physiological grounds (you will not be marked down if your medical justification is not exactly correct or complete, but do you best). Aim for between five and ten predictor variables (slightly more or fewer is OK). You should assess each model you consider for goodness of fit and other relevant statistics to help you choose between them. For your final model, present a set of diagnostic statistics and/or charts and comment on them. You don’t need to do an exhaustive exploratory data analysis of all the variables in the data set, but you should examine those variables that you use in your model. Finally, re-fit your final model to the unimputed data frame (icu_patients_df0.rds) and comment on any differences you find compared to the same model fitted to the imputed data.

    Hints

    1. Select an initial subset of explanatory variables that you will use to predict the risk of in-hospital death. Justify your choice.

    2. Conduct basic exploratory data analysis on your variables of choice.

    3. Fit appropriate univariate logistic regression models.

    4. Fit an appropriate series of multivariable logistic regression models, justifying your approach. Assess each model you consider for goodness of fit and other relevant statistics.

    5. Present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge.

    6. For your final model, present a set of diagnostic statistics and/or charts and comment on them.

    7. Write a paragraph summarising the most important findings of your final model. Include the most important values from the statistical output, and a simple clinical interpretation.

    Create your response to this task here, as a mixture of embedded (knitr) R code and any resulting outputs, and explanatory or commentary text.

    Regarding which variables to include in the model selection process we decided to first include the basic patient descriptor variables age and gender, as it is a good idea to inspect these regardless of the sort of their suspected significance. Furthermore, the type of ICU and length of stay were also included, as were both the SAPS-I score and the SOFA score. We hypothesized the ICU type may be an indicator of the severity of the patient’s condition, and thus a reasonable predictor of mortality. The SOFA AND SAPS-I scores are risk assessment measures, ergo they are an obvious choice to include for further inspection.

    We also included the minimum white blood cell count. A low white blood cell count is indicative of a weak immune system, a person with no white blood cells is incapable of fighting infection, and thus is more likely to die from a variety of conditions than someone with a sufficient amount of white blood cells. The maximum creatine level was included as this is a common measure of renal function. Fi02 was chosen as a candidate variable over Pa02 because you can titrate Fi02 to correct low Pa02 until later stages of disease. Maximum lactate was examined as it is often elevated in states of hypoperfusion. Extremes at the high and low ends of the pH scale often warrant a poor prognosis, thus both considered. Tropinon, a cardiac enzyme, is elevated in myocardial ischaemia. Finally, temperature and glucose levels, both at a maximum and minimum, were explored.

    Exploratory Data Analysis

    # Libraries
    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    library(ggplot2)
    library(DescTools)
    library(faraway)
    library(arm)
    ## Loading required package: MASS
    ## 
    ## Attaching package: 'MASS'
    ## The following object is masked from 'package:dplyr':
    ## 
    ##     select
    ## Loading required package: Matrix
    ## Loading required package: lme4
    ## 
    ## arm (Version 1.11-2, built: 2020-7-27)
    ## Working directory is /Users/joshbryden/Google Drive/UNSW/2020-T2/HDAT9600/HDAT9600_Group_Project
    ## 
    ## Attaching package: 'arm'
    ## The following objects are masked from 'package:faraway':
    ## 
    ##     fround, logit, pfround
    library(ROCR)
    # Reading in the data sets
    icu_patients_df0 <- readRDS("icu_patients_df0.rds")
    icu_patients_df1 <- readRDS("icu_patients_df1.rds")

    From the EDA below for patient descriptor variables it can be observed that the oldest patient in the data set was 90, the youngest 16, with the median age being 67. On average, the women admitted to the ICU were slightly older than the men, 65.5 years old compared to 63.5. However, by frequency there were more male patients (1148) than female patients (913).

    # EDA for the patient descriptor variables
    
    # -------------------- General summary statistics by variable ----------------#
    summary(icu_patients_df1$Age)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   16.00   52.00   67.00   64.41   78.00   90.00
    table(icu_patients_df1$Gender)
    ## 
    ## Female   Male 
    ##    913   1148
    # ----------------------  Age by gender ------------------------------------- #
    # Plot
    ggplot(data = icu_patients_df1, aes(x = Age, fill = Gender)) + geom_histogram() +
       facet_grid(Gender ~ .) + ggtitle("Distribution of age by gender")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # Summary Statistics
    group_by(icu_patients_df1, Gender) %>% summarize(mean_age = mean(Age))
    ## `summarise()` ungrouping output (override with `.groups` argument)
    ## # A tibble: 2 x 2
    ##   Gender mean_age
    ##   <fct>     <dbl>
    ## 1 Female     65.5
    ## 2 Male       63.5

    The clinical variables are SOFA and SAPS-I scores. SOFA stands for sequential organ failure assessment. It is a mortality prediction score based on the degree of dysfunction of six organ systems. The SAPS-I score is another mortality risk prediction measure that is calculated from 14 routine physiological measurements. Note that in the histograms below the vertical red line indicates the mean of the plotted measurement.

    # EDA for clinical variables
    
    # SOFA summary and plot
    summary(icu_patients_df1$SOFA)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##  -1.000   3.000   6.000   6.441   9.000  22.000
    ggplot(data = icu_patients_df1, aes(x = SOFA)) + geom_histogram(fill = "cornflowerblue") + 
       geom_vline(xintercept = mean(icu_patients_df1$SOFA), col = "red") +
       ggtitle("Distribution of SOFA scores")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # SAPS1 summary and plot
    summary(icu_patients_df1$SAPS1)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    ##    1.00   11.00   15.00   14.96   19.00   34.00      96
    ggplot(data = icu_patients_df1, aes(x = SAPS1)) + geom_histogram(fill = "cornflowerblue") +
       geom_vline(xintercept = mean(na.exclude(icu_patients_df1$SAPS1)), col = "red") +
       ggtitle("Distribution of SAPS1 scores")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    ## Warning: Removed 96 rows containing non-finite values (stat_bin).

    Brief EDA for the remaining clinical variables is presented below. Similarly to the above plots, the vertical line represents the mean for that particular measure. The distributions vary in shape, but structurally everything appears to be in order with the data entry of the measurements.

    # EDA for other variables
    
    # ICU type
    table(icu_patients_df1$ICUType)
    ## 
    ##            Coronary Care Unit Cardiac Surgery Recovery Unit 
    ##                           297                           448 
    ##                   Medical ICU                  Surgical ICU 
    ##                           788                           528
    # White blood cell (minimum) count
    summary(icu_patients_df1$WBC_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    0.10    7.60   10.40   11.51   14.10  128.30
    ggplot(data = icu_patients_df1, aes(x = WBC_min)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$WBC_min), col = "cornflowerblue") +
       ggtitle("Distribution of minimum white blooc cell counts")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # WBC diff count
    summary(icu_patients_df1$WBC_diff)
    ##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    ##   0.03315   2.63315   4.53315   5.82079   7.23315 143.46685
    ggplot(data = icu_patients_df1, aes(x = WBC_diff)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$WBC_diff), col = "cornflowerblue") +
       ggtitle("Distribution of white blood cell count differences between mean and most extreme value")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # Creatine (max)
    summary(icu_patients_df1$Creatinine_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   0.200   0.800   1.000   1.499   1.500  22.000
    ggplot(data = icu_patients_df1, aes(x = Creatinine_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Creatinine_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum creatinine measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # albumin (min)
    summary(icu_patients_df1$Albumin_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   1.100   2.600   3.000   3.012   3.500   5.300
    ggplot(data = icu_patients_df1, aes(x = Albumin_min)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Albumin_min), col = "cornflowerblue") +
       ggtitle("Distribution of minimum albumin measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # Fi02
    summary(icu_patients_df1$FiO2_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##  0.2800  0.5000  1.0000  0.7874  1.0000  1.0000
    ggplot(data = icu_patients_df1, aes(x = FiO2_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$FiO2_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum FiO2 measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # lactate (max)
    summary(icu_patients_df1$Lactate_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   0.400   1.500   2.200   2.773   3.200  29.300
    ggplot(data = icu_patients_df1, aes(x = Lactate_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Lactate_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum lactate measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # pH (max and min)
    summary(icu_patients_df1$pH_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   7.150   7.380   7.420   7.418   7.460   7.690
    ggplot(data = icu_patients_df1, aes(x = pH_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$pH_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum pH measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    summary(icu_patients_df1$pH_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   3.000   7.280   7.340   7.327   7.390   7.630
    ggplot(data = icu_patients_df1, aes(x = pH_min)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$pH_min), col = "cornflowerblue") +
       ggtitle("Distribution of minimum pH measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # Temperature(max and min)
    summary(icu_patients_df1$Temp_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   35.40   37.10   37.60   37.69   38.20   42.10
    ggplot(data = icu_patients_df1, aes(x = Temp_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Temp_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum temperature measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    summary(icu_patients_df1$Temp_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   24.20   35.60   36.10   36.01   36.60   38.30
    ggplot(data = icu_patients_df1, aes(x = Temp_min)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Temp_min), col = "cornflowerblue") +
       ggtitle("Distribution of minimum temperature measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # MAP (min and max)
    summary(icu_patients_df1$MAP_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##     4.0    94.0   104.0   111.8   117.0   291.0
    ggplot(data = icu_patients_df1, aes(x = MAP_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$MAP_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum MAP measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    summary(icu_patients_df1$MAP_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    1.00   55.00   61.00   62.76   70.00  265.00
    ggplot(data = icu_patients_df1, aes(x = MAP_min)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$MAP_min), col = "cornflowerblue") +
       ggtitle("Distribution of minimum MAP measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # Glusoce (max and min)
    summary(icu_patients_df1$Glucose_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    39.0   117.0   141.0   163.3   180.0  1143.0
    ggplot(data = icu_patients_df1, aes(x = Glucose_max)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Glucose_max), col = "cornflowerblue") +
       ggtitle("Distribution of maximum glusoce measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    summary(icu_patients_df1$Glucose_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    24.0    98.0   117.0   124.8   141.0   632.0
    ggplot(data = icu_patients_df1, aes(x = Glucose_min)) + geom_histogram(fill = "salmon1") +
       geom_vline(xintercept = mean(icu_patients_df1$Glucose_min), col = "cornflowerblue") +
       ggtitle("Distribution of minimum glusoce measurements")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    Model Fitting

    Next, we fit each candidate variable as a univariate predictor of in hospital mortality. Out of the patient descriptor variables age was found to be significant while gender was not. Both clinical risk predictions scores were found to be significant, as was ICU type. Among the additional clinical measures we chose as candidate variables only FiO2 (max), pH (max), temperature (max), MAP (max and min), and glucose (min) were found to not be significant.

    # Fitting univariate models for the predictors chosen so far
    
    # Univariate fits
    age_fit        <- glm(in_hospital_death ~ Age, family  = "binomial", data = icu_patients_df1)
    gender_fit     <- glm(in_hospital_death ~ Gender, family = "binomial", data = icu_patients_df1)
    SAPS_fit       <- glm(in_hospital_death ~ SAPS1, family = "binomial", data = icu_patients_df1)
    SOFA_fit       <- glm(in_hospital_death ~ SOFA, family = "binomial", data = icu_patients_df1)
    WBCmin_fit     <- glm(in_hospital_death ~ WBC_min, family = "binomial", data = icu_patients_df1)
    WBCdiff_fit    <- glm(in_hospital_death ~ WBC_diff, family = "binomial", data = icu_patients_df1)
    creatine_fit   <- glm(in_hospital_death ~ Creatinine_max, family = "binomial", data = icu_patients_df1)
    albumin_fit    <- glm(in_hospital_death ~ Albumin_min, family = "binomial", data = icu_patients_df1)
    fio2_fit       <- glm(in_hospital_death ~ FiO2_max, family = "binomial", data = icu_patients_df1)
    lactate_fit    <- glm(in_hospital_death ~ Lactate_max, family = "binomial", data = icu_patients_df1)
    pHmax_fit      <- glm(in_hospital_death ~ pH_max, family = "binomial", data = icu_patients_df1)
    pHmin_fit      <- glm(in_hospital_death ~ pH_min, family = "binomial", data = icu_patients_df1)
    tempmax_fit    <- glm(in_hospital_death ~ Temp_max, family = "binomial", data = icu_patients_df1)
    tempmin_fit    <- glm(in_hospital_death ~ Temp_min, family = "binomial", data = icu_patients_df1)
    mapmax_fit     <- glm(in_hospital_death ~ MAP_max, family = "binomial", data = icu_patients_df1)
    mapmin_fit     <- glm(in_hospital_death ~ MAP_min, family = "binomial", data = icu_patients_df1)
    glucosemax_fit <- glm(in_hospital_death ~ Glucose_max, family = "binomial", data = icu_patients_df1)
    glucosemin_fit <- glm(in_hospital_death ~ Glucose_min, family = "binomial", data = icu_patients_df1)
    ICUtype_fit    <- glm(in_hospital_death ~ ICUType, family = "binomial", data = icu_patients_df1)
    
    # Summaries
    summary(age_fit)        # significant, AIC = 1648.9
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Age, family = "binomial", data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.7522  -0.6264  -0.5111  -0.3919   2.5135  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -3.761624   0.303337 -12.401  < 2e-16 ***
    ## Age          0.029376   0.004229   6.947 3.73e-12 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1644.9  on 2059  degrees of freedom
    ## AIC: 1648.9
    ## 
    ## Number of Fisher Scoring iterations: 5
    summary(gender_fit)     # non significant
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Gender, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.5612  -0.5612  -0.5553  -0.5553   1.9728  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.76894    0.09381 -18.856   <2e-16 ***
    ## GenderMale  -0.02281    0.12615  -0.181    0.856    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1699.7  on 2059  degrees of freedom
    ## AIC: 1703.7
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(SAPS_fit)       # significant, AIC = 1534.8
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ SAPS1, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.3834  -0.5894  -0.4662  -0.3448   2.6278  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -3.79727    0.23888 -15.896   <2e-16 ***
    ## SAPS1        0.12558    0.01338   9.384   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1627.0  on 1964  degrees of freedom
    ## Residual deviance: 1530.8  on 1963  degrees of freedom
    ##   (96 observations deleted due to missingness)
    ## AIC: 1534.8
    ## 
    ## Number of Fisher Scoring iterations: 5
    summary(SOFA_fit)       # significant, AIC = 1611.5
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ SOFA, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.0747  -0.5835  -0.4771  -0.3623   2.4609  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -2.83453    0.14090 -20.117   <2e-16 ***
    ## SOFA         0.14378    0.01539   9.342   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1607.5  on 2059  degrees of freedom
    ## AIC: 1611.5
    ## 
    ## Number of Fisher Scoring iterations: 5
    summary(WBCmin_fit)     # significant, AIC = 1699.6
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ WBC_min, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.2861  -0.5638  -0.5477  -0.5315   2.0563  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.987167   0.119356 -16.649   <2e-16 ***
    ## WBC_min      0.017452   0.008437   2.068   0.0386 *  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1695.6  on 2059  degrees of freedom
    ## AIC: 1699.6
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(WBCdiff_fit)    # significant, AIC = 1694.7 ,ONLY CHOOSE WBC_MIN OR WBC_DIFF TO AVOID MULTICOLLINEARITY
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ WBC_diff, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.9566  -0.5607  -0.5420  -0.5273   2.0360  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.939861   0.084336 -23.002  < 2e-16 ***
    ## WBC_diff     0.025751   0.008809   2.923  0.00346 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1690.7  on 2059  degrees of freedom
    ## AIC: 1694.7
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(creatine_fit)   # significant, AIC = 1678.4
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Creatinine_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.8627  -0.5433  -0.5270  -0.5151   2.0633  
    ## 
    ## Coefficients:
    ##                Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)    -2.05087    0.08430 -24.328  < 2e-16 ***
    ## Creatinine_max  0.16325    0.03135   5.208 1.91e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1674.4  on 2059  degrees of freedom
    ## AIC: 1678.4
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(albumin_fit)    # significant, AIC = 1680
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Albumin_min, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.7948  -0.5887  -0.5385  -0.4595   2.2842  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -0.36392    0.29389  -1.238    0.216    
    ## Albumin_min -0.48186    0.09987  -4.825  1.4e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1676.0  on 2059  degrees of freedom
    ## AIC: 1680
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(fio2_fit)       # not significant
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ FiO2_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.5634  -0.5634  -0.5555  -0.5504   1.9898  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)  -1.8614     0.2102  -8.856   <2e-16 ***
    ## FiO2_max      0.1011     0.2533   0.399     0.69    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1699.5  on 2059  degrees of freedom
    ## AIC: 1703.5
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(lactate_fit)    # significant, AIC = 1673.5
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Lactate_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.1726  -0.5544  -0.5200  -0.4939   2.1212  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)  -2.1932     0.1005 -21.820  < 2e-16 ***
    ## Lactate_max   0.1372     0.0244   5.625 1.86e-08 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1669.5  on 2059  degrees of freedom
    ## AIC: 1673.5
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(pHmax_fit)      # not significant
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ pH_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.6684  -0.5677  -0.5523  -0.5297   2.0743  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)
    ## (Intercept)   9.3001     7.0197   1.325    0.185
    ## pH_max       -1.4944     0.9469  -1.578    0.115
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1697.2  on 2059  degrees of freedom
    ## AIC: 1701.2
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(pHmin_fit)      # significant, AIC = 1681.4
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ pH_min, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.9980  -0.5733  -0.5358  -0.4868   2.2874  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)  19.5912     4.8996   3.998 6.37e-05 ***
    ## pH_min       -2.9197     0.6699  -4.358 1.31e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1677.4  on 2059  degrees of freedom
    ## AIC: 1681.4
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(tempmax_fit)    # not significant
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Temp_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.6077  -0.5689  -0.5549  -0.5366   2.1386  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)
    ## (Intercept)  1.60419    3.08152   0.521    0.603
    ## Temp_max    -0.08988    0.08183  -1.098    0.272
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1698.5  on 2059  degrees of freedom
    ## AIC: 1702.5
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(tempmin_fit)    # significant, AIC = 1688.3
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Temp_min, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.8918  -0.5741  -0.5409  -0.4973   2.2040  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)   7.4599     2.3473   3.178  0.00148 ** 
    ## Temp_min     -0.2571     0.0654  -3.931 8.45e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1684.3  on 2059  degrees of freedom
    ## AIC: 1688.3
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(mapmax_fit)     # not significant, AIC = 1703.7
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ MAP_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.5583  -0.5579  -0.5579  -0.5578   1.9695  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.779817   0.215732  -8.250   <2e-16 ***
    ## MAP_max     -0.000016   0.001846  -0.009    0.993    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1699.7  on 2059  degrees of freedom
    ## AIC: 1703.7
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(mapmin_fit)     # not significant
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ MAP_min, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.6583  -0.5674  -0.5551  -0.5341   2.4214  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.413112   0.257434  -5.489 4.04e-08 ***
    ## MAP_min     -0.005926   0.004051  -1.463    0.143    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1697.4  on 2059  degrees of freedom
    ## AIC: 1701.4
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(glucosemax_fit) # significant, AIC = 1688.2
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Glucose_max, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.4117  -0.5572  -0.5343  -0.5162   2.0872  
    ## 
    ## Coefficients:
    ##               Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -2.1865370  0.1202802 -18.179  < 2e-16 ***
    ## Glucose_max  0.0023817  0.0005819   4.093 4.25e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1684.2  on 2059  degrees of freedom
    ## AIC: 1688.2
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(glucosemin_fit) # not significant 
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Glucose_min, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.7799  -0.5613  -0.5522  -0.5428   2.0271  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.967537   0.171241 -11.490   <2e-16 ***
    ## Glucose_min  0.001476   0.001253   1.178    0.239    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1698.4  on 2059  degrees of freedom
    ## AIC: 1702.4
    ## 
    ## Number of Fisher Scoring iterations: 4
    summary(ICUtype_fit) # significant, AIC = 1663.3
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ ICUType, family = "binomial", 
    ##     data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -0.6402  -0.6402  -0.5615  -0.3458   2.3861  
    ## 
    ## Coefficients:
    ##                                      Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)                           -1.6463     0.1576 -10.443  < 2e-16 ***
    ## ICUTypeCardiac Surgery Recovery Unit  -1.1407     0.2563  -4.451 8.55e-06 ***
    ## ICUTypeMedical ICU                     0.1653     0.1824   0.906    0.365    
    ## ICUTypeSurgical ICU                   -0.1214     0.2001  -0.607    0.544    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1699.7  on 2060  degrees of freedom
    ## Residual deviance: 1655.3  on 2057  degrees of freedom
    ## AIC: 1663.3
    ## 
    ## Number of Fisher Scoring iterations: 5

    From the univariate fits we chose to further examine a number of models. The first of which contained all predictors significant in the univariate analyses, as well as gender. From there, given that all candidate variables have been deemed either statistically and/or clinically significant, we employed backwards elimination to minimize AIC and find the optimal model. Judging by AIC, BIC, and other metric provided by the PseudoR2 function it can be observed that the model fit using backwards elimination is preferable. Thus, our final model included the predictors: age, SAPS1, SOFA, maximum lactate measurement, as well as ICU type. While this model may seem rather simple, it should be noted that the SOFA and SAPS1 scores are derived from many of the clinical measurements in the data set and thus their importance is captured in the background.

    # Skeleton code for picking the next few models
    
    # Candidate models from above variables
    cand_1_fit <- glm(in_hospital_death ~ Age + Gender + SAPS1 + SOFA + WBC_diff + Creatinine_max + Albumin_min + Lactate_max + 
                         pH_min + Temp_min + Glucose_max + ICUType, family = "binomial", data = icu_patients_df1)
    
    cand_backelim_fit <- step(cand_1_fit, direction = "back")
    ## Start:  AIC=1421.26
    ## in_hospital_death ~ Age + Gender + SAPS1 + SOFA + WBC_diff + 
    ##     Creatinine_max + Albumin_min + Lactate_max + pH_min + Temp_min + 
    ##     Glucose_max + ICUType
    ## 
    ##                  Df Deviance    AIC
    ## - Glucose_max     1   1391.3 1419.3
    ## - Gender          1   1391.4 1419.4
    ## - WBC_diff        1   1391.8 1419.8
    ## - Temp_min        1   1392.1 1420.1
    ## - Creatinine_max  1   1393.0 1421.0
    ## - pH_min          1   1393.0 1421.0
    ## - Lactate_max     1   1393.2 1421.2
    ## <none>                1391.3 1421.3
    ## - Albumin_min     1   1393.9 1421.9
    ## - SOFA            1   1402.9 1430.9
    ## - SAPS1           1   1403.8 1431.8
    ## - Age             1   1432.2 1460.2
    ## - ICUType         3   1459.7 1483.7
    ## 
    ## Step:  AIC=1419.28
    ## in_hospital_death ~ Age + Gender + SAPS1 + SOFA + WBC_diff + 
    ##     Creatinine_max + Albumin_min + Lactate_max + pH_min + Temp_min + 
    ##     ICUType
    ## 
    ##                  Df Deviance    AIC
    ## - Gender          1   1391.5 1417.5
    ## - WBC_diff        1   1391.8 1417.8
    ## - Temp_min        1   1392.2 1418.2
    ## - Creatinine_max  1   1393.0 1419.0
    ## - pH_min          1   1393.0 1419.0
    ## <none>                1391.3 1419.3
    ## - Lactate_max     1   1393.4 1419.4
    ## - Albumin_min     1   1393.9 1419.9
    ## - SOFA            1   1402.9 1428.9
    ## - SAPS1           1   1404.5 1430.5
    ## - Age             1   1432.2 1458.2
    ## - ICUType         3   1464.2 1486.2
    ## 
    ## Step:  AIC=1417.45
    ## in_hospital_death ~ Age + SAPS1 + SOFA + WBC_diff + Creatinine_max + 
    ##     Albumin_min + Lactate_max + pH_min + Temp_min + ICUType
    ## 
    ##                  Df Deviance    AIC
    ## - WBC_diff        1   1392.0 1416.0
    ## - Temp_min        1   1392.3 1416.3
    ## - pH_min          1   1393.2 1417.2
    ## - Creatinine_max  1   1393.3 1417.3
    ## <none>                1391.5 1417.5
    ## - Lactate_max     1   1393.6 1417.6
    ## - Albumin_min     1   1394.0 1418.0
    ## - SOFA            1   1403.4 1427.4
    ## - SAPS1           1   1404.5 1428.5
    ## - Age             1   1432.2 1456.2
    ## - ICUType         3   1464.2 1484.2
    ## 
    ## Step:  AIC=1415.97
    ## in_hospital_death ~ Age + SAPS1 + SOFA + Creatinine_max + Albumin_min + 
    ##     Lactate_max + pH_min + Temp_min + ICUType
    ## 
    ##                  Df Deviance    AIC
    ## - Temp_min        1   1392.9 1414.9
    ## - pH_min          1   1393.6 1415.6
    ## - Creatinine_max  1   1393.8 1415.8
    ## <none>                1392.0 1416.0
    ## - Lactate_max     1   1394.1 1416.1
    ## - Albumin_min     1   1394.4 1416.4
    ## - SOFA            1   1404.1 1426.1
    ## - SAPS1           1   1404.6 1426.6
    ## - Age             1   1433.4 1455.4
    ## - ICUType         3   1464.2 1482.2
    ## 
    ## Step:  AIC=1414.88
    ## in_hospital_death ~ Age + SAPS1 + SOFA + Creatinine_max + Albumin_min + 
    ##     Lactate_max + pH_min + ICUType
    ## 
    ##                  Df Deviance    AIC
    ## - pH_min          1   1394.6 1414.6
    ## - Creatinine_max  1   1394.7 1414.7
    ## <none>                1392.9 1414.9
    ## - Albumin_min     1   1395.4 1415.4
    ## - Lactate_max     1   1395.7 1415.7
    ## - SOFA            1   1404.8 1424.8
    ## - SAPS1           1   1406.9 1426.9
    ## - Age             1   1434.9 1454.9
    ## - ICUType         3   1464.3 1480.3
    ## 
    ## Step:  AIC=1414.62
    ## in_hospital_death ~ Age + SAPS1 + SOFA + Creatinine_max + Albumin_min + 
    ##     Lactate_max + ICUType
    ## 
    ##                  Df Deviance    AIC
    ## - Creatinine_max  1   1396.5 1414.5
    ## <none>                1394.6 1414.6
    ## - Albumin_min     1   1397.0 1415.0
    ## - Lactate_max     1   1398.2 1416.2
    ## - SOFA            1   1408.5 1426.5
    ## - SAPS1           1   1408.8 1426.8
    ## - Age             1   1435.9 1453.9
    ## - ICUType         3   1466.4 1480.4
    ## 
    ## Step:  AIC=1414.51
    ## in_hospital_death ~ Age + SAPS1 + SOFA + Albumin_min + Lactate_max + 
    ##     ICUType
    ## 
    ##               Df Deviance    AIC
    ## - Albumin_min  1   1398.5 1414.5
    ## <none>             1396.5 1414.5
    ## - Lactate_max  1   1399.8 1415.8
    ## - SAPS1        1   1410.9 1426.9
    ## - SOFA         1   1413.8 1429.8
    ## - Age          1   1437.1 1453.1
    ## - ICUType      3   1476.3 1488.3
    ## 
    ## Step:  AIC=1414.47
    ## in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max + ICUType
    ## 
    ##               Df Deviance    AIC
    ## <none>             1398.5 1414.5
    ## - Lactate_max  1   1402.0 1416.0
    ## - SAPS1        1   1412.9 1426.9
    ## - SOFA         1   1419.0 1433.0
    ## - Age          1   1439.2 1453.2
    ## - ICUType      3   1481.9 1491.9
    # Summaries
    summary(cand_1_fit)
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Age + Gender + SAPS1 + SOFA + 
    ##     WBC_diff + Creatinine_max + Albumin_min + Lactate_max + pH_min + 
    ##     Temp_min + Glucose_max + ICUType, family = "binomial", data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.4315  -0.5690  -0.3992  -0.2439   2.9331  
    ## 
    ## Coefficients:
    ##                                        Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)                           1.7125053  4.2517991   0.403 0.687116    
    ## Age                                   0.0293566  0.0048130   6.099 1.06e-09 ***
    ## GenderMale                            0.0578426  0.1420371   0.407 0.683835    
    ## SAPS1                                 0.0716758  0.0203541   3.521 0.000429 ***
    ## SOFA                                  0.0794779  0.0235266   3.378 0.000730 ***
    ## WBC_diff                             -0.0075925  0.0112084  -0.677 0.498157    
    ## Creatinine_max                        0.0553524  0.0413612   1.338 0.180809    
    ## Albumin_min                          -0.1895045  0.1165692  -1.626 0.104017    
    ## Lactate_max                           0.0436155  0.0312626   1.395 0.162976    
    ## pH_min                               -0.5661247  0.4512019  -1.255 0.209586    
    ## Temp_min                             -0.0701843  0.0760048  -0.923 0.355789    
    ## Glucose_max                           0.0001117  0.0007467   0.150 0.881065    
    ## ICUTypeCardiac Surgery Recovery Unit -1.5891191  0.2862601  -5.551 2.84e-08 ***
    ## ICUTypeMedical ICU                    0.1845113  0.2057159   0.897 0.369760    
    ## ICUTypeSurgical ICU                  -0.0859675  0.2251218  -0.382 0.702557    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1627.0  on 1964  degrees of freedom
    ## Residual deviance: 1391.3  on 1950  degrees of freedom
    ##   (96 observations deleted due to missingness)
    ## AIC: 1421.3
    ## 
    ## Number of Fisher Scoring iterations: 5
    summary(cand_backelim_fit)
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max + 
    ##     ICUType, family = "binomial", data = icu_patients_df1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.4353  -0.5710  -0.4000  -0.2492   2.9116  
    ## 
    ## Coefficients:
    ##                                       Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)                          -5.570809   0.430858 -12.930  < 2e-16 ***
    ## Age                                   0.028988   0.004763   6.087 1.15e-09 ***
    ## SAPS1                                 0.071969   0.019024   3.783 0.000155 ***
    ## SOFA                                  0.098617   0.022035   4.475 7.63e-06 ***
    ## Lactate_max                           0.055024   0.029396   1.872 0.061228 .  
    ## ICUTypeCardiac Surgery Recovery Unit -1.634703   0.275855  -5.926 3.11e-09 ***
    ## ICUTypeMedical ICU                    0.200727   0.202790   0.990 0.322259    
    ## ICUTypeSurgical ICU                  -0.102964   0.220491  -0.467 0.640518    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1627.0  on 1964  degrees of freedom
    ## Residual deviance: 1398.5  on 1957  degrees of freedom
    ##   (96 observations deleted due to missingness)
    ## AIC: 1414.5
    ## 
    ## Number of Fisher Scoring iterations: 5
    # Comparing the two
    PseudoR2(cand_1_fit, which = "all")
    ##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
    ##       0.1449129       0.1264745       0.1130701       0.2008057       0.1071343 
    ## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
    ##       0.2365221       0.1390483       0.2642414       0.1372283    1421.2600331 
    ##             BIC          logLik         logLik0              G2 
    ##    1505.0087460    -695.6300166    -813.5195273     235.7790214
    PseudoR2(cand_backelim_fit, which = "all")
    ##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
    ##       0.1404791       0.1306452       0.1098079       0.1950123       0.1041979 
    ## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
    ##       0.2300393       0.1356312       0.2593508       0.1328953    1414.4741522 
    ##             BIC          logLik         logLik0              G2 
    ##    1459.1401324    -699.2370761    -813.5195273     228.5649023

    Model Diagnostics

    When diagnosing the quality of the model we first examined the binned residuals plot. Initially, slight alarm was raised over the residuals on the leftmost side of the bottom half of the plot. However, the residuals seem to broadly be acceptable. The halfnorm plot indicated two values of concern. Upon further investigation these patients had normal measurements for the variables used in our final model. Given the large size of the data these two patients should not be of much concern.

    # Renaming final model
    final_model <- cand_backelim_fit
    
    # Residual checks
    binnedplot(predict(final_model), residuals(final_model))

    # Outlier checks
    halfnorm(hatvalues(final_model))

    # Inspecting suspect values
    icu_patients_df1[hatvalues(final_model) > 0.04, c("in_hospital_death", "Age", "ICUType", "SOFA", "SAPS1")]
    ##      in_hospital_death Age     ICUType SOFA SAPS1
    ## 162                  1  71 Medical ICU   -1    13
    ## 1597                 0  77 Medical ICU   12    24
    # ROC curve
    #pred_prob = predict(final_model, type = "response")
    #label_vec <- na.exclude(icu_patients_df1$in_hospital_death)
    #pred <- prediction(pred_prob, label_vec, label.ordering = NULL)
    #perf <- performance(pred, measure = "tpr", x.measure = "fpr")
    #plot(perf, col = rainbow(10))

    When interpreting the model it is best to default to the odds, thus, we first exponentiated the repression coefficients. The interpretation of such odds are as follows: the odds of dying in the ICU are 2.9% greater for every 1 year increase in age, 7.4% greater for every unit increase in SAPS1 score, 10.3% higher for every unit increase in SOFA score, 5.6% greater for every unit increase in maximum lactate measurement, 80.5% lower if you are in the surgical recovery unit compared to baseline (coronary care unit), 22.2% higher if you are in the medical ICU compared to baseline, and about 10% lower if you are in the surgical ICU compared to baseline.

    final_coef <- coef(final_model)
    exp(final_coef)
    ##                          (Intercept)                                  Age 
    ##                            0.0038074                            1.0294126 
    ##                                SAPS1                                 SOFA 
    ##                            1.0746216                            1.1036440 
    ##                          Lactate_max ICUTypeCardiac Surgery Recovery Unit 
    ##                            1.0565663                            0.1950102 
    ##                   ICUTypeMedical ICU                  ICUTypeSurgical ICU 
    ##                            1.2222905                            0.9021597

    After fitting the model the the unimputed data set we can see that the AIC is actually much lower, and more variables are significant at the 0.05 level.

    # Fitting final model to second ICU data set
    final_model_newdata <- glm(in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max + ICUType, data = icu_patients_df0)
    summary(final_model_newdata)
    ## 
    ## Call:
    ## glm(formula = in_hospital_death ~ Age + SAPS1 + SOFA + Lactate_max + 
    ##     ICUType, data = icu_patients_df0)
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -0.55141  -0.20650  -0.10595   0.01153   1.12886  
    ## 
    ## Coefficients:
    ##                                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                          -0.2354064  0.0648056  -3.633 0.000296 ***
    ## Age                                   0.0027118  0.0007015   3.866 0.000118 ***
    ## SAPS1                                 0.0083246  0.0030865   2.697 0.007117 ** 
    ## SOFA                                  0.0154750  0.0036109   4.286 2.01e-05 ***
    ## Lactate_max                           0.0166539  0.0048162   3.458 0.000568 ***
    ## ICUTypeCardiac Surgery Recovery Unit -0.2382688  0.0451132  -5.282 1.59e-07 ***
    ## ICUTypeMedical ICU                   -0.0268378  0.0408673  -0.657 0.511528    
    ## ICUTypeSurgical ICU                  -0.0901556  0.0422844  -2.132 0.033251 *  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for gaussian family taken to be 0.1220829)
    ## 
    ##     Null deviance: 136.76  on 963  degrees of freedom
    ## Residual deviance: 116.71  on 956  degrees of freedom
    ##   (1097 observations deleted due to missingness)
    ## AIC: 718.34
    ## 
    ## Number of Fisher Scoring iterations: 2

    Task 2 (15 marks)

    In this task, you are required to develop a Cox proportional hazards survival model using the icu_patients_df1 data set which adequately explains or predicts the length of survival indicated by the Days variable, with censoring as indicated by the Status variable. You should fit a series of models, maybe three or four, evaluating each one, before you present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge. Aim for between five and ten predictor variables (slightly more or fewer is OK). It is perfectly acceptable to include predictor variables in your final model which are not statistically significant, as long as you justify their inclusion on medical or physiological grounds (you will not be marked down if your medical justification is not exactly correct, but do you best). You should assess each model you consider for goodness of fit and other relevant statistics, and you should assess your final model for violations of assumptions and perform other diagnostics which you think are relevant (and modify the model if indicated, or at least comment on the possible impact of what your diagnostics show). Finally, re-fit your final model to the unimputed data frame (icu_patients_df0.rds) and comment on any differences you find.

    Hints

    1. Select an initial subset of explanatory variables that you will use to predict survival. Justify your choice.

    2. Conduct basic exploratory data analysis on your variables of choice.

    3. Fit appropriate univariate Cox proportional hazards models.

    4. Fit an appropriate series of multivariable Cox proportional hazards models, justifying your approach. Assess each model you consider for goodness of fit and other relevant statistics.

    5. Present your final model. Your final model should not include all the predictor variables, just a small subset of them, which you have selected based on statistical significance and/or background knowledge.

    6. For your final model, present a set of diagnostic statistics and/or charts and comment on them.

    7. Write a very brief paragraph summarising the most important findings of your final model. Include the most important values from the statistical output, and a simple clinical interpretation.

    Create your response to this task here, as a mixture of embedded (knitr) R code and any resulting outputs, and explanatory or commentary text.

    Task 2 EDA

    # basic EDA in task2
    
    # load the survival library
    library(survival)
    ## 
    ## Attaching package: 'survival'
    ## The following objects are masked from 'package:faraway':
    ## 
    ##     rats, solder
    # read the dataset
    head(icu_patients_df1)
    ##   RecordID Length_of_stay SAPS1 SOFA Survival in_hospital_death Days Status Age
    ## 1   132539              5     6    1       NA                 0 2408  FALSE  54
    ## 2   132540              8    16    8       NA                 0 2408  FALSE  76
    ## 3   132541             19    21   11       NA                 0 2408  FALSE  44
    ## 4   132543              9     7    1      575                 0  575   TRUE  68
    ## 5   132545              4    17    2      918                 0  918   TRUE  88
    ## 6   132547              6    14   11     1637                 0 1637   TRUE  64
    ##   Albumin_diff Albumin_max Albumin_min   ALP_diff ALP_max ALP_min  ALT_diff
    ## 1    0.2186633         3.2         3.1 118.147964     214     202  80.44617
    ## 2    0.8813367         2.1         2.2 252.147964     338     348  94.44617
    ## 3    0.6813367         2.7         2.3  31.147964     127     105  45.44617
    ## 4    1.4186633         4.4         4.4   9.147964     105     105 108.44617
    ## 5    0.3813367         2.7         2.6  56.852036      39      78  96.44617
    ## 6    0.4186633         3.4         3.3   5.147964     101     101  75.44617
    ##   ALT_max ALT_min  AST_diff AST_max AST_min Bilirubin_diff Bilirubin_max
    ## 1      40      75 131.35271      38      53       1.464039           0.4
    ## 2     206      26 116.35271      53      74       1.564039           1.2
    ## 3      91      75  65.64729     235     164       1.235961           3.0
    ## 4      12      12 154.35271      15      15       1.564039           0.2
    ## 5      24      32 154.35271      15      97       1.364039           0.4
    ## 6      60      45 122.35271     162      47       1.364039           0.4
    ##   Bilirubin_min  BUN_diff BUN_max BUN_min Cholesterol_diff Cholesterol_max
    ## 1           0.3 11.527053      13      13         16.42276             154
    ## 2           0.2  8.527053      18      16         28.42276             139
    ## 3           2.8 21.527053       8       3         56.42276             111
    ## 4           0.2  4.527053      23      20         37.42276             127
    ## 5           0.9 20.472947      45      45         55.42276             104
    ## 6           0.4  9.527053      19      15         55.57724             212
    ##   Cholesterol_min Creatinine_diff Creatinine_max Creatinine_min DiasABP_diff
    ## 1             140       0.4324463            0.8            0.8           NA
    ## 2             128       0.4324463            1.2            0.8     26.54421
    ## 3             100       0.9324463            0.4            0.3           NA
    ## 4             119       0.5324463            0.9            0.7           NA
    ## 5             101       0.2324463            1.0            1.0           NA
    ## 6             212       0.3324463            1.4            0.9     20.45579
    ##   DiasABP_max DiasABP_min  FiO2_diff FiO2_max FiO2_min GCS_diff GCS_max GCS_min
    ## 1          NA          NA 0.05192012      0.5      0.5 3.755971      15      15
    ## 2          81          32 0.44807988      1.0      0.4 8.244029      15       3
    ## 3          NA          NA 0.44807988      1.0      0.5 6.244029       8       5
    ## 4          NA          NA 0.44807988      1.0      0.4 3.755971      15      14
    ## 5          NA          NA 0.15192012      0.4      0.5 3.755971      15      15
    ## 6          79          55 0.05192012      0.5      0.5 4.244029       9       7
    ##   Gender Glucose_diff Glucose_max Glucose_min HCO3_diff HCO3_max HCO3_min
    ## 1 Female     65.14446         205         205  3.227452       26       26
    ## 2   Male     34.85554         105         105  1.772548       22       21
    ## 3 Female     20.85554         141         119  3.227452       26       24
    ## 4   Male     33.85554         129         106  5.227452       28       27
    ## 5 Female     26.85554         113         113  4.772548       18       18
    ## 6   Male    124.14446         264         197  3.772548       19       19
    ##    HCT_diff HCT_max HCT_min Height   HR_diff HR_max HR_min
    ## 1  2.739871    33.7    33.5     NA 29.077891     80     58
    ## 2  6.260129    29.7    24.7  175.3  7.077891     88     80
    ## 3  4.260129    28.5    26.7     NA 30.077891    113     57
    ## 4 10.339871    41.3    36.1  180.3 30.077891     88     57
    ## 5  8.360129    30.8    22.6     NA 20.077891     94     67
    ## 6 10.639871    41.6    36.8  180.3 16.077891     91     71
    ##                         ICUType    K_diff K_max K_min Lactate_diff Lactate_max
    ## 1                  Surgical ICU 0.2647934   4.4   4.4    0.9964037         1.9
    ## 2 Cardiac Surgery Recovery Unit 0.1647934   4.3   4.3    1.4964037         2.9
    ## 3                   Medical ICU 4.4647934   8.6   3.3    1.4964037         1.9
    ## 4                   Medical ICU 0.1352066   4.2   4.0    1.5964037         1.2
    ## 5                   Medical ICU 1.8647934   6.0   3.8    0.8964037         2.0
    ## 6            Coronary Care Unit 0.9647934   5.1   3.8    1.8964037         0.9
    ##   Lactate_min MAP_diff MAP_max MAP_min   Mg_diff Mg_max Mg_min   Na_diff Na_max
    ## 1         1.8 31.23164     109      56 0.4842982    1.5    1.5 2.2066071    137
    ## 2         1.3 34.76836     100      43 1.1157018    3.1    1.9 0.2066071    139
    ## 3         1.3 53.23164     131      71 0.6842982    1.9    1.3 2.2066071    140
    ## 4         1.5 24.23164     102      72 0.1157018    2.1    2.1 1.7933929    141
    ## 5         1.9  9.76836      78      68 0.4842982    1.5    1.5 0.7933929    140
    ## 6         1.3 24.23164     102      62 0.2842982    1.7    1.7 2.2066071    141
    ##   Na_min NIDiasABP_diff NIDiasABP_max NIDiasABP_min NIMAP_diff NIMAP_max
    ## 1    137       17.49101            65            40   17.04069     92.33
    ## 2    139       19.49101            65            38   26.38069     86.33
    ## 3    137       37.50899            95            66   34.28931    110.00
    ## 4    140       23.50899            81            54   24.98931    100.70
    ## 5    140       38.50899            96            29   29.98931    105.70
    ## 6    137       31.50899            89            52   26.58931    102.30
    ##   NIMAP_min NISysABP_diff NISysABP_max NISysABP_min PaCO2_diff PaCO2_max
    ## 1     58.67      40.30125          157           96   3.335797        37
    ## 2     49.33      44.69875          129           72   7.335797        41
    ## 3     83.33      33.30125          150          111   3.335797        37
    ## 4     73.00      23.30125          140          102   9.335797        38
    ## 5     63.67      39.30125          156          119   6.335797        34
    ## 6     61.67      35.69875          129           81   5.335797        45
    ##   PaCO2_min PaO2_diff PaO2_max PaO2_min    pH_diff pH_max pH_min Platelets_diff
    ## 1        38  47.61789      186      111 0.12011376   7.49   7.43       31.23069
    ## 2        33 286.38211      445       89 0.08011376   7.45   7.34       36.23069
    ## 3        37  93.61789       65       65 0.14011376   7.51   7.51      117.76931
    ## 4        31  94.61789      148       64 0.14011376   7.51   7.47      201.23069
    ## 5        35  80.61789       78       84 0.04011376   7.38   7.41       80.76931
    ## 6        35  80.61789      101       78 0.07988624   7.40   7.29       86.23069
    ##   Platelets_max Platelets_min RespRate_diff RespRate_max RespRate_min SaO2_diff
    ## 1           221           221       7.34858           24           12  3.246079
    ## 2           226           164      16.65142           36           11  1.753921
    ## 3            84            72      13.65142           33           18  2.246079
    ## 4           391           315       7.34858           21           12  1.753921
    ## 5           109           109       6.65142           26           15  3.246079
    ## 6           276           219      27.65142           47           20  1.246079
    ##   SaO2_max SaO2_min SysABP_diff SysABP_max SysABP_min Temp_diff Temp_max
    ## 1       98       94          NA         NA         NA  1.874083     38.1
    ## 2       99       97     50.3105        135         66  2.474083     37.9
    ## 3       95       95          NA         NA         NA  2.025917     39.0
    ## 4       99       97          NA         NA         NA  1.874083     36.7
    ## 5       97       94          NA         NA         NA  1.174083     37.8
    ## 6       97       96     43.3105        152         73  1.174083     37.8
    ##   Temp_min TroponinI_diff TroponinI_max TroponinI_min TroponinT_diff
    ## 1     35.1      5.1429448           1.0           0.3      0.4785006
    ## 2     34.5     26.2570552          31.7          16.1      0.6485006
    ## 3     36.7     31.2570552          33.4          36.7      0.8814994
    ## 4     35.1      0.8570552           5.9           6.3      0.6485006
    ## 5     35.8      0.1570552           5.6           5.6      0.6085006
    ## 6     35.8      4.1429448           1.3           1.3      0.6385006
    ##   TroponinT_max TroponinT_min Urine_diff Urine_max Urine_min   WBC_diff WBC_max
    ## 1          0.58          0.19  800.78242       900        30  0.9331524    11.2
    ## 2          0.43          0.02  670.78242       770         0  4.7331524    13.1
    ## 3          1.55          1.41  310.78242       410        30  8.4331524     4.2
    ## 4          0.10          0.02  600.78242       700       100  3.3331524    11.5
    ## 5          0.06          0.37   83.21758       150        16  8.3331524     3.8
    ## 6          0.03          0.10 1100.78242      1200        40 11.8668476    24.0
    ##   WBC_min Weight_diff Weight_max Weight_min
    ## 1    11.2          NA         NA         NA
    ## 2     7.4    4.699878       80.6       76.0
    ## 3     3.7   23.999878       56.7       56.7
    ## 4     8.8    3.900122       84.6       84.6
    ## 5     3.8          NA         NA         NA
    ## 6    14.4   33.300122      114.0      114.0
    nrow(icu_patients_df1)
    ## [1] 2061
    # Note: status = TRUE = event occurred, status = FALSE = no event 
    
    # summary the number of days
    summary(icu_patients_df1$Days)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##       0     265    2408    1634    2408    2408
    hist(icu_patients_df1$Days)

    # check the status
    table(icu_patients_df1$Status)
    ## 
    ## FALSE  TRUE 
    ##  1288   773
    # check age and gender
    summary(icu_patients_df1$Age)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   16.00   52.00   67.00   64.41   78.00   90.00
    summary(icu_patients_df1$Gender)
    ## Female   Male 
    ##    913   1148

    The following variables of interest were chosen for the following reasons: Gender was chosen as we wanted to investigate whether this had an effect on patient outcomes alongside the ICU Type and Age. These variables were simple to test for and if significant, can provide some indication as to underlying causes of why death/ survival was the outcome. WBC variables were chosen as WBC counts can infer the presence of infection / chronic disease, as such we felt that this was an important set of variables to test for. Furthermore glucose levels were chosen as potentially significant variables to test for as they too can infer both underlying physiological/ renal system issues, or the presence of chronic disease (diabetes). Heart rate and respiratory rates were also chosen as variables of interest as these measures have well defined homeostatic ranges, which in turn, if out of this normal range, can infer the presence of an underlying physiological cause that could alter these values. Finally both the SAPS1 and SOFA variables were chosen as they represent an amalgamation of clinical measures that attempt to address sequential organ failure (SOFA) and a collection of physiological scores (SAPS1). As both these variables address many different clinical and physiological measures in one score, we felt it key to include these as variables of interest.

    # ----------------------------------------Gender----------------------------#
    summary(icu_patients_df1$Gender)
    ## Female   Male 
    ##    913   1148
    # ----------------------------------------ICUType----------------------------#
    summary(icu_patients_df1$ICUType)
    ##            Coronary Care Unit Cardiac Surgery Recovery Unit 
    ##                           297                           448 
    ##                   Medical ICU                  Surgical ICU 
    ##                           788                           528
    # ----------------------------------------Age----------------------------#
    summary(icu_patients_df1$Age)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   16.00   52.00   67.00   64.41   78.00   90.00
    ggplot(data = icu_patients_df1, aes(x = Age)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$Age), col="blue")+
       ggtitle("Age histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------WBC_diff----------------------------#
    summary(icu_patients_df1$WBC_diff)
    ##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    ##   0.03315   2.63315   4.53315   5.82079   7.23315 143.46685
    ggplot(data = icu_patients_df1, aes(x = WBC_diff)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$WBC_diff), col="blue")+
       ggtitle("WBC_diff histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------WBC_min----------------------------#
    summary(icu_patients_df1$WBC_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    0.10    7.60   10.40   11.51   14.10  128.30
    ggplot(data = icu_patients_df1, aes(x = WBC_min)) + geom_histogram(fill = "red")+
        geom_vline(xintercept = mean(icu_patients_df1$WBC_min), col="blue")+
       ggtitle("WBC_min histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------WBC_max----------------------------#
    summary(icu_patients_df1$WBC_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    0.10    9.30   12.30   13.95   16.90  155.60
    ggplot(data = icu_patients_df1, aes(x = WBC_max)) + geom_histogram(fill = "red")+
        geom_vline(xintercept = mean(icu_patients_df1$WBC_max), col="blue")+
       ggtitle("WBC_max histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------Glucose_diff----------------------------#
    summary(icu_patients_df1$Glucose_diff)
    ##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    ##    0.1445   23.8555   39.1445   57.0844   61.8555 1003.1445
    ggplot(data = icu_patients_df1, aes(x = Glucose_diff)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$Glucose_diff), col="blue")+
       ggtitle("Glucose_diff histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------Glucose_min----------------------------#
    summary(icu_patients_df1$Glucose_min)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    24.0    98.0   117.0   124.8   141.0   632.0
    ggplot(data = icu_patients_df1, aes(x = Glucose_min)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$Glucose_min), col="blue")+
       ggtitle("Glucose_min histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------Glucose_max----------------------------#
    summary(icu_patients_df1$Glucose_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    39.0   117.0   141.0   163.3   180.0  1143.0
    ggplot(data = icu_patients_df1, aes(x = Glucose_max)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$Glucose_max), col="blue")+
       ggtitle("Glucose_max histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ----------------------------------------SAPS1--------------------------------------#
    summary(icu_patients_df1$SAPS1)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    ##    1.00   11.00   15.00   14.96   19.00   34.00      96
    ggplot(data = icu_patients_df1, aes(x = SAPS1)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$SAPS1), col="blue")+
       ggtitle("SAPS1 histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    ## Warning: Removed 96 rows containing non-finite values (stat_bin).
    ## Warning: Removed 1 rows containing missing values (geom_vline).

    # ----------------------------------------SOFA---------------------------------------#
    summary(icu_patients_df1$SOFA)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##  -1.000   3.000   6.000   6.441   9.000  22.000
    ggplot(data = icu_patients_df1, aes(x = SOFA)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$SOFA), col="blue")+
       ggtitle("SOFA histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # --------------------------------------HR_max---------------------------------------#
    summary(icu_patients_df1$HR_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    44.0    91.0   104.0   106.6   119.0   300.0
    ggplot(data = icu_patients_df1, aes(x = HR_max)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$HR_max), col="blue")+
       ggtitle("HR_max histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    # ---------------------------------------RespRate_max----------------------------------#
    summary(icu_patients_df1$RespRate_max)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   13.00   24.00   27.00   29.12   33.00   98.00
    ggplot(data = icu_patients_df1, aes(x = RespRate_max)) + geom_histogram(fill = "red")+
       geom_vline(xintercept = mean(icu_patients_df1$RespRate_max), col="blue")+
       ggtitle("RespRate_max histogram")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    Univariate Cox Models

    # remove all the NA values from the dataset
    icu_patients_df1 <- na.omit(icu_patients_df1)
    
    # fitting a simple Cox model with gender using coxph
    cox.gender <- coxph( Surv(Days, Status) ~ Gender, data=icu_patients_df1)
    summary(cox.gender)
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ Gender, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##              coef exp(coef) se(coef)     z Pr(>|z|)
    ## GenderMale 0.1183    1.1255   0.1477 0.801    0.423
    ## 
    ##            exp(coef) exp(-coef) lower .95 upper .95
    ## GenderMale     1.126     0.8885    0.8427     1.503
    ## 
    ## Concordance= 0.511  (se = 0.021 )
    ## Likelihood ratio test= 0.64  on 1 df,   p=0.4
    ## Wald test            = 0.64  on 1 df,   p=0.4
    ## Score (logrank) test = 0.64  on 1 df,   p=0.4
    # fitting a simple Cox model with ICUType using coxph 
    cox.ICUType <- coxph( Surv(Days, Status) ~ ICUType, data=icu_patients_df1)
    summary(cox.ICUType)  
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ ICUType, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##                                         coef exp(coef) se(coef)      z Pr(>|z|)
    ## ICUTypeCardiac Surgery Recovery Unit -0.1845    0.8315   0.2275 -0.811    0.417
    ## ICUTypeMedical ICU                    0.2261    1.2538   0.2132  1.061    0.289
    ## ICUTypeSurgical ICU                   0.3238    1.3823   0.2061  1.571    0.116
    ## 
    ##                                      exp(coef) exp(-coef) lower .95 upper .95
    ## ICUTypeCardiac Surgery Recovery Unit    0.8315     1.2026    0.5323     1.299
    ## ICUTypeMedical ICU                      1.2538     0.7976    0.8256     1.904
    ## ICUTypeSurgical ICU                     1.3823     0.7234    0.9230     2.070
    ## 
    ## Concordance= 0.555  (se = 0.022 )
    ## Likelihood ratio test= 7.16  on 3 df,   p=0.07
    ## Wald test            = 6.95  on 3 df,   p=0.07
    ## Score (logrank) test = 7.05  on 3 df,   p=0.07
    # fitting a simple Cox model with Age using coxph 
    cox.Age <- coxph( Surv(Days, Status) ~ Age, data=icu_patients_df1)
    summary(cox.Age)  
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ Age, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##         coef exp(coef) se(coef)     z Pr(>|z|)
    ## Age 0.001164  1.001165 0.005857 0.199    0.842
    ## 
    ##     exp(coef) exp(-coef) lower .95 upper .95
    ## Age     1.001     0.9988    0.9897     1.013
    ## 
    ## Concordance= 0.518  (se = 0.025 )
    ## Likelihood ratio test= 0.04  on 1 df,   p=0.8
    ## Wald test            = 0.04  on 1 df,   p=0.8
    ## Score (logrank) test = 0.04  on 1 df,   p=0.8
    # fitting a simple Cox model with WBC_max using coxph
    cox.WBC_max <- coxph( Surv(Days, Status) ~ WBC_max, data=icu_patients_df1)
    summary(cox.WBC_max)     # significant
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ WBC_max, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##             coef exp(coef) se(coef)     z Pr(>|z|)   
    ## WBC_max 0.018929  1.019109 0.006873 2.754  0.00588 **
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##         exp(coef) exp(-coef) lower .95 upper .95
    ## WBC_max     1.019     0.9812     1.005     1.033
    ## 
    ## Concordance= 0.548  (se = 0.025 )
    ## Likelihood ratio test= 6.43  on 1 df,   p=0.01
    ## Wald test            = 7.59  on 1 df,   p=0.006
    ## Score (logrank) test = 7.64  on 1 df,   p=0.006
    # fitting a simple Cox model with WBC_diff using coxph
    cox.WBC_diff <- coxph( Surv(Days, Status) ~ WBC_diff, data=icu_patients_df1)
    summary(cox.WBC_diff) 
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ WBC_diff, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##              coef exp(coef) se(coef)     z Pr(>|z|)  
    ## WBC_diff 0.017377  1.017529 0.008527 2.038   0.0416 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##          exp(coef) exp(-coef) lower .95 upper .95
    ## WBC_diff     1.018     0.9828     1.001     1.035
    ## 
    ## Concordance= 0.531  (se = 0.026 )
    ## Likelihood ratio test= 3.51  on 1 df,   p=0.06
    ## Wald test            = 4.15  on 1 df,   p=0.04
    ## Score (logrank) test = 4.21  on 1 df,   p=0.04
    # fitting a simple Cox model with Glucose_max using coxph
    cox.Glucose_max <- coxph( Surv(Days, Status) ~ Glucose_max, data=icu_patients_df1)
    summary(cox.Glucose_max) # significant
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ Glucose_max, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##                  coef exp(coef)  se(coef)     z Pr(>|z|)   
    ## Glucose_max 0.0021497 1.0021520 0.0007545 2.849  0.00438 **
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##             exp(coef) exp(-coef) lower .95 upper .95
    ## Glucose_max     1.002     0.9979     1.001     1.004
    ## 
    ## Concordance= 0.563  (se = 0.025 )
    ## Likelihood ratio test= 6.48  on 1 df,   p=0.01
    ## Wald test            = 8.12  on 1 df,   p=0.004
    ## Score (logrank) test = 8.14  on 1 df,   p=0.004
    # fitting a simple Cox model with Glucose_diff using coxph
    cox.Glucose_diff <- coxph( Surv(Days, Status) ~ Glucose_diff, data=icu_patients_df1)
    summary(cox.Glucose_diff) # significant
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ Glucose_diff, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##                   coef exp(coef)  se(coef)    z Pr(>|z|)   
    ## Glucose_diff 0.0025905 1.0025939 0.0007899 3.28  0.00104 **
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##              exp(coef) exp(-coef) lower .95 upper .95
    ## Glucose_diff     1.003     0.9974     1.001     1.004
    ## 
    ## Concordance= 0.574  (se = 0.023 )
    ## Likelihood ratio test= 7.91  on 1 df,   p=0.005
    ## Wald test            = 10.76  on 1 df,   p=0.001
    ## Score (logrank) test = 11.21  on 1 df,   p=8e-04
    # fitting a simple Cox model with SAPS1 using coxph
    cox.SAPS1 <- coxph( Surv(Days, Status) ~ SAPS1, data=icu_patients_df1)
    summary(cox.SAPS1) # significant
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ SAPS1, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##          coef exp(coef) se(coef)     z Pr(>|z|)  
    ## SAPS1 0.03439   1.03499  0.01450 2.372   0.0177 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##       exp(coef) exp(-coef) lower .95 upper .95
    ## SAPS1     1.035     0.9662     1.006     1.065
    ## 
    ## Concordance= 0.578  (se = 0.023 )
    ## Likelihood ratio test= 5.67  on 1 df,   p=0.02
    ## Wald test            = 5.63  on 1 df,   p=0.02
    ## Score (logrank) test = 5.62  on 1 df,   p=0.02
    # fitting a simple Cox model with SOFA using coxph
    cox.SOFA <- coxph( Surv(Days, Status) ~ SOFA, data=icu_patients_df1)
    summary(cox.SOFA) # significant
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ SOFA, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##         coef exp(coef) se(coef)     z Pr(>|z|)   
    ## SOFA 0.05041   1.05170  0.01903 2.649  0.00808 **
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##      exp(coef) exp(-coef) lower .95 upper .95
    ## SOFA     1.052     0.9508     1.013     1.092
    ## 
    ## Concordance= 0.59  (se = 0.024 )
    ## Likelihood ratio test= 7.1  on 1 df,   p=0.008
    ## Wald test            = 7.02  on 1 df,   p=0.008
    ## Score (logrank) test = 7.04  on 1 df,   p=0.008
    # fitting a simple Cox model with HR_max using coxph
    cox.HR_max <- coxph( Surv(Days, Status) ~ HR_max, data=icu_patients_df1)
    summary(cox.HR_max)  # significant
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ HR_max, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##            coef exp(coef) se(coef)     z Pr(>|z|)  
    ## HR_max 0.005178  1.005191 0.002198 2.355   0.0185 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##        exp(coef) exp(-coef) lower .95 upper .95
    ## HR_max     1.005     0.9948     1.001      1.01
    ## 
    ## Concordance= 0.563  (se = 0.024 )
    ## Likelihood ratio test= 4.73  on 1 df,   p=0.03
    ## Wald test            = 5.55  on 1 df,   p=0.02
    ## Score (logrank) test = 5.5  on 1 df,   p=0.02
    # fitting a simple Cox model with RespRate_max using coxph
    cox.RespRate_max <- coxph( Surv(Days, Status) ~ RespRate_max, data=icu_patients_df1)
    summary(cox.RespRate_max)  
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ RespRate_max, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##                 coef exp(coef) se(coef)     z Pr(>|z|)
    ## RespRate_max 0.01219   1.01227  0.00822 1.483    0.138
    ## 
    ##              exp(coef) exp(-coef) lower .95 upper .95
    ## RespRate_max     1.012     0.9879    0.9961     1.029
    ## 
    ## Concordance= 0.539  (se = 0.027 )
    ## Likelihood ratio test= 2.16  on 1 df,   p=0.1
    ## Wald test            = 2.2  on 1 df,   p=0.1
    ## Score (logrank) test = 2.21  on 1 df,   p=0.1

    Multivariate Cox Models

    # fitting series of multivariate Cox proportional hazards models 
    cox.mv1 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max, data=icu_patients_df1)
    summary(cox.mv1)
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max, 
    ##     data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##            coef exp(coef) se(coef)     z Pr(>|z|)  
    ## SAPS1   0.01057   1.01063  0.01835 0.576   0.5645  
    ## SOFA    0.03648   1.03715  0.02366 1.541   0.1232  
    ## WBC_max 0.01505   1.01516  0.00728 2.067   0.0388 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##         exp(coef) exp(-coef) lower .95 upper .95
    ## SAPS1       1.011     0.9895    0.9749     1.048
    ## SOFA        1.037     0.9642    0.9901     1.086
    ## WBC_max     1.015     0.9851    1.0008     1.030
    ## 
    ## Concordance= 0.597  (se = 0.022 )
    ## Likelihood ratio test= 11.95  on 3 df,   p=0.008
    ## Wald test            = 12.63  on 3 df,   p=0.006
    ## Score (logrank) test = 12.81  on 3 df,   p=0.005
    cox.mv2 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max, data=icu_patients_df1)
    summary(cox.mv2)
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + 
    ##     HR_max, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##              coef exp(coef)  se(coef)      z Pr(>|z|)  
    ## SAPS1   -0.001324  0.998677  0.019245 -0.069   0.9451  
    ## SOFA     0.043489  1.044448  0.023215  1.873   0.0610 .
    ## WBC_max  0.016200  1.016332  0.007437  2.178   0.0294 *
    ## HR_max   0.005377  1.005391  0.002585  2.080   0.0375 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##         exp(coef) exp(-coef) lower .95 upper .95
    ## SAPS1      0.9987     1.0013    0.9617     1.037
    ## SOFA       1.0444     0.9574    0.9980     1.093
    ## WBC_max    1.0163     0.9839    1.0016     1.031
    ## HR_max     1.0054     0.9946    1.0003     1.010
    ## 
    ## Concordance= 0.598  (se = 0.024 )
    ## Likelihood ratio test= 15.83  on 4 df,   p=0.003
    ## Wald test            = 16.38  on 4 df,   p=0.003
    ## Score (logrank) test = 16.55  on 4 df,   p=0.002
    cox.mv3 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max, data=icu_patients_df1)
    summary(cox.mv3)
    ## Call:
    ## coxph(formula = Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + 
    ##     HR_max + Glucose_max, data = icu_patients_df1)
    ## 
    ##   n= 190, number of events= 190 
    ## 
    ##                   coef  exp(coef)   se(coef)      z Pr(>|z|)  
    ## SAPS1       -0.0085564  0.9914801  0.0194297 -0.440   0.6597  
    ## SOFA         0.0436689  1.0446364  0.0233888  1.867   0.0619 .
    ## WBC_max      0.0177555  1.0179140  0.0076591  2.318   0.0204 *
    ## HR_max       0.0048966  1.0049086  0.0025927  1.889   0.0589 .
    ## Glucose_max  0.0017174  1.0017188  0.0007929  2.166   0.0303 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##             exp(coef) exp(-coef) lower .95 upper .95
    ## SAPS1          0.9915     1.0086    0.9544     1.030
    ## SOFA           1.0446     0.9573    0.9978     1.094
    ## WBC_max        1.0179     0.9824    1.0027     1.033
    ## HR_max         1.0049     0.9951    0.9998     1.010
    ## Glucose_max    1.0017     0.9983    1.0002     1.003
    ## 
    ## Concordance= 0.614  (se = 0.023 )
    ## Likelihood ratio test= 19.84  on 5 df,   p=0.001
    ## Wald test            = 21.37  on 5 df,   p=7e-04
    ## Score (logrank) test = 21.75  on 5 df,   p=6e-04
    # compare models
    print(anova(cox.mv1, cox.mv2))
    ## Analysis of Deviance Table
    ##  Cox model: response is  Surv(Days, Status)
    ##  Model 1: ~ SAPS1 + SOFA + WBC_max
    ##  Model 2: ~ SAPS1 + SOFA + WBC_max + HR_max
    ##    loglik  Chisq Df P(>|Chi|)  
    ## 1 -804.50                      
    ## 2 -802.56 3.8874  1   0.04865 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    print(anova(cox.mv2, cox.mv3))
    ## Analysis of Deviance Table
    ##  Cox model: response is  Surv(Days, Status)
    ##  Model 1: ~ SAPS1 + SOFA + WBC_max + HR_max
    ##  Model 2: ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max
    ##    loglik  Chisq Df P(>|Chi|)  
    ## 1 -802.56                      
    ## 2 -800.56 4.0039  1   0.04539 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    From comparing models with the anova() function, we found that cox.mv2 has significant p-value 0.049, thus we choose cox.mv2. Hence make further model comparisons to be sure all these variables should be retained using drop1()

    cox.mv2 <- coxph(Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max , data=icu_patients_df1)
    print(drop1(cox.mv2, test="Chisq"))
    ## Single term deletions
    ## 
    ## Model:
    ## Surv(Days, Status) ~ SAPS1 + SOFA + WBC_max + HR_max + Glucose_max
    ##             Df    AIC    LRT Pr(>Chi)  
    ## <none>         1611.1                  
    ## SAPS1        1 1609.3 0.1938  0.65978  
    ## SOFA         1 1612.6 3.4845  0.06194 .
    ## WBC_max      1 1613.9 4.8150  0.02821 *
    ## HR_max       1 1612.4 3.2384  0.07193 .
    ## Glucose_max  1 1613.1 4.0039  0.04539 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Remove SAPS1 and SOFA from the model after checking drop1() result

    cox.refit <- coxph(Surv(Days, Status) ~ WBC_max + Glucose_max, data=icu_patients_df1)
    print(drop1(cox.refit, test="Chisq"))
    ## Single term deletions
    ## 
    ## Model:
    ## Surv(Days, Status) ~ WBC_max + Glucose_max
    ##             Df    AIC    LRT Pr(>Chi)  
    ## <none>         1612.0                  
    ## WBC_max      1 1616.5 6.4378  0.01117 *
    ## Glucose_max  1 1616.5 6.4915  0.01084 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    The outputs from univariate cox show that the demographic variables like age and gender are not statistically significant, thus, we would not keep it in the models. And there is no significance regrading ICUType is also different with our initial expected.

    Compared to the cox.refit model, cox.mv3 has a lower AIC value, so we prefer the cox.mv2 model.

    Save, knit and submit

    Reminder: don’t forget to save this file, to knit it to check that everything works, and then submit via the drop box in OpenLearning.

    Submit your assignment

    When you have finished, and are satisfied with your assignment solutions, and this file knits without errors and the output looks the way you want, then you should submit via the drop box in OpenLearning.

    Problems?

    If you encounter problems with any part of the process described above, please contact the course convenor via OpenLearning as soon as possible so that the issues can be resolved in good time, and well before the assignment is due.

    Additional Information

    Each task attracts the indicated number of marks (out of a total of 30 marks for the assignment). The instructions are deliberately open-ended and less prescriptive than the individual assignments to allow you some latitude in what you do and how you go about the task. However, to complete the tasks and gain full marks, you only need to replicate or repeat the steps covered in the course - if you do most or all of the things described in the revalant chapters of the HDAT9600 course, full marks will be awarded.

    Note also that with respect to the model fitting, there are no right or wrong answers when it comes to variable selection and other aspects of model specification. Deep understanding of the underlying medical concepts which govern patient treatment and outcomes in ICUs is not required or assumed, although you should try to gain some understanding of each variable using the links provided. You will not be marked down if your medical justifications are not exactly correct or complete, but do you best, and don’t hesitate to seek help from the course convenor.